European Radiology
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match European Radiology's content profile, based on 11 papers previously published here. The average preprint has a 0.14% match score for this journal, so anything above that is already an above-average fit.
Lettner, J. D.; Evrenoglou, T.; Binder, H.; Fichtner-Feigl, S.; Neubauer, C.; Ruess, D. A.
Show abstract
BackgroundAI-based radiomics has demonstrated promising diagnostic performance for pancreatic cystic neoplasms, yet clinical translation remains limited. Whether this reflects insufficient model performance or structural limitations of the evidence base remains unclear. MethodsWe performed a systematic review and diagnostic test accuracy meta-analysis of AI-based radiomics in pancreatic cyst (2015-2025), addressing two clinically relevant tasks (Q1: cyst type differentiation/Q2: malignancy or high-grade dysplasia prediction). Training and validation datasets were synthesized independently using hierarchical models. Study evaluation extended beyond diagnostic performance to a four-dimensional framework integrating RQS 2.0, METRICS, TRIPOD+AI and PROBAST+AI explicitly contrasting pooled diagnostic performance with reporting quality, methodological rigor, and risk of bias. The review was pre-registered (PROSPERO) and conducted according to PRISMA 2020. ResultsTwenty-nine studies were included (Q1: n = 15; Q2: n = 14), predominantly retrospective and single center. Training-based analyses showed high apparent diagnostic performance for Q1 (pooled sensitivity/specificity: 0.89 [95% CI, 0.85-0.92]/ 0.90 [0.85-0.93]), but there was substantial heterogeneity ({tau}{superscript 2} = 0.56/0.78; {rho} = 0.38). Validation-based performance remained high (0.86 [0.82-0.89]/ 0.88 [0.81-0.93]), while heterogeneity persisted and prediction regions exceeded confidence regions. Training-based analyses demonstrated similarly high apparent performance (0.88 [0.79-0.95]/0.89 [0.81-0.94]) for Q2, with pronounced heterogeneity ({tau}{superscript 2} = 1.98/1.61; {rho} = 0.63). Validation-based performance was slightly lower, yet still clinically comparable (0.82 [0.75-0.89]/0.86 [0.80-0.91]), and heterogeneity persisted ({tau}{superscript 2} = 0.71/0.43; {rho} = 0.15). Across both tasks, high diagnostic accuracy occurred alongside incomplete reporting, limited validation and an elevated risk of bias. ConclusionAI-based radiomics for pancreatic cysts has reached a structural performance plateau. Further improvements in diagnostic accuracy alone are insufficient to achieve clinical translation and must be accompanied by a paradigm shift from performance-driven model development toward decision-anchored study designs, robust validation strategies, transparent reporting standard, and clinically integrated evaluation frameworks. SummaryAlthough pancreatic cystic lesions are increasingly being detected, imaging-based decision-making remains limited, particularly regarding differentiating between cyst types and stratifying malignancy risk. In this PRISMA-compliant and PROSPERO-registered systematic review and meta-analysis of diagnostic tests, we evaluated the use of AI-based radiomics for these two tasks, as well as its contextualized performance. In addition, a four-dimensional framework was employed to conduct the evaluation, incorporating diagnostic accuracy, reporting quality, risk of bias, and radiomics maturity. Across studies published between 2015 and 2025, the pooled diagnostic performance was consistently high, with only modest declines observed from the training to the validation stage. Nevertheless, considerable heterogeneity between studies and limited transportability remained evident. Multidimensional evaluation indicated a systematic dissociation between reported performance and methodological robustness, characterized by incomplete reporting, restricted validation, and an elevated risk of bias. These limitations were consistent across both clinical questions and were not resolved by increasing model complexity. The findings of this meta-analysis suggest that the structural performance of AI-based radiomics for pancreatic cysts has plateaued. To progress towards clinical translation, it is necessary to employ study designs anchored in decision-making processes, robust multi-center validation, and transparent, reproducible evaluation frameworks. This is preferred to further optimization of model architecture alone.
Choi, H.; Bae, S.; Na, K. J.
Show abstract
BackgroundAlthough deep learning models have improved individual PET analysis, image processing and quantification tasks, end-to-end automation from raw DICOM to quantitative clinical reporting remains limited, particularly in heterogeneous real-world settings. MethodsAs a proof-of-concept, an autonomous large language model (LLM)-orchestrated multi-tool agent for end-to-end PET/CT interpretation was developed. A reasoning-based text LLM selected appropriate series from raw DICOM, coordinated registration and SUV conversion, invoked segmentation and detection tools, generated maximum-intensity projections, called a vision-enabled LLM for interpretation, and synthesized structured draft reports. The system was retrospectively evaluated in 170 patients undergoing baseline FDG PET/CT for lung cancer staging, using expert reports as reference. ResultsThe agent successfully completed the full end-to-end workflow from raw DICOM selection to structured draft report generation without human intervention in all 170 examinations. Primary tumor detection achieved 100% sensitivity. For nodal involvement, sensitivity was 84.8% and specificity was 39.4%, whereas distant metastasis detection showed 70.2% sensitivity and 65.0% specificity. Discrepancy analysis of 58 nodal and 57 metastatic mismatch cases revealed systematic false-positive findings related to reactive or physiologic uptake and false-negative findings involving small-volume or anatomically atypical metastases. ConclusionLLM-orchestrated PET/CT agents can enable workflow-level automation from raw DICOM to quantification and structured draft reporting under real-world conditions. Although primary tumor detection was highly reliable, nodal and metastatic assessment revealed systematic limitations, supporting a collaborative role with continued expert oversight in complex clinical scenarios.
Sahin, S.; Diaz, E.; Rajagopal, A.; Abtahi, M.; Jones, S.; Dai, Q.; Kramer, S.; Wang, Z.; Larson, P. E. Z.
Show abstract
Current standard of care imaging practices cannot reliably differentiate among certain renal tumors such as benign oncocytoma and clear cell renal cell carcinoma (RCC), and between low and high grade RCCs. Previous work has explored using deep learning, radiomics, and texture analysis to predict renal tumor subtypes and differentiate between low and high grade RCCs with mixed success. To further this work, large diverse datasets are needed to improve model performance and provide strong evaluation sets. In this work, a dataset of 831 multi-phase 3D CT exams was curated. Each exam contains up to three contrast-enhanced CT phases. Tumor outlines or bounding boxes were annotated and registered to the image volumes. The pathology results for each tumor and relevant patient metadata are also included.
Kästingschäfer, K. F.; Fink, A.; Rau, S.; Reisert, M.; Kellner, E.; Nolde, J. M.; Kottgen, A.; Sekula, P.; Bamberg, F.; Russe, M. F.
Show abstract
Rationale and ObjectivesContrast-enhanced (CE) MRI provides clear corticomedullary contrast for renal compartment delineation but may be contraindicated or undesirable in routine practice. We aimed to enable automated extraction of renal imaging biomarkers from routine non-contrast-enhanced (NCE) T1-weighted MRI by transferring CE-derived compartment labels. Materials and MethodsThis retrospective single-center study (January 2017 to December 2021) included 200 participants with paired arterial-phase CE and NCE T1-weighted MRI. Cortex, medulla, and sinus were manually segmented on CE MRI and rigidly transferred to NCE MRI to provide voxel-level reference labels. A hierarchical 3D Deep Neural Patchworks model was trained on 100 examinations (90 training/10 validation) and evaluated on an independent test set of 100 examinations using the transferred CE masks on NCE as reference. Performance was assessed using Dice similarity of segmentations and biomarker agreement using volumes and surface areas (Pearson/Spearman, MAE, Lins CCC, and Bland-Altman). ResultsWhole-kidney segmentation Dice was 0.950 (left) and 0.953 (right). Total kidney volume showed high agreement with minimal bias (MAE 8.76 mL, 2.5% of mean; CCC 0.983; bias -1.56 mL; 95% limits of agreement -28.81 to 25.69 mL). Cortex volume was modestly overestimated and medulla volume underestimated, shifting predicted compartment fractions toward cortex (74.7% vs. 72,1% in ground truth; medulla 21.5% vs. 24.3%; sinus 3.8% vs. 3.6%. Sinus volume maintained high concordance despite higher Dice dispersion. Surface area was systematically underestimated with low concordance. ConclusionCE-supervised knowledge transfer enables accurate, well-calibrated kidney volumetry from routine NCE MRI and supports contrast-free renal biomarker extraction. Surface area estimation remains challenging. Take-home MessagesO_LICE-supervised label transfer enables accurate, well-calibrated contrast-free kidney volumetry on routine non-contrast T1-weighted MRI. C_LIO_LICompartment volumetry is feasible but shows systematic cortex overestimation and medulla underestimation; surface area remains non-interchangeable due to boundary uncertainty. C_LI
Castelo, A.; O'Connor, C.; Gupta, A. C.; Anderson, B. M.; Woodland, M.; Altaie, M.; Koay, E. J.; Odisio, B. C.; Tang, T. T.; Brock, K. K.
Show abstract
Artificial intelligence (AI) based segmentation has many medical applications but limited curated datasets challenge model training; this study compares the impact of dataset annotation quality and quantity on whole liver AI segmentation performance. We obtained 3,089 abdominal computed tomography scans with whole-liver contours from MD Anderson Cancer Center (MDA) and a MICCAI challenge. A total of 249 scans were withheld for testing of which 30, MICCAI challenge data, were reserved for external validation. The remaining scans were divided into mixed-curation and highly-curated groups, randomly sampled into sub-datasets of various sizes, and used to train 3D nnU-Net segmentation models. Dice similarity coefficients (DSC), surface DSC with 2mm margins (SD 2mm), the 95th percentile of Hausdorff distance (HD95), and 2D axial slice DSC (Slice DSC) were used to evaluate model performance. The highly curated, 244-scan model (DSC=0.971, SD 2mm=0.958, HD95=2.98mm) performed insignificantly different on 3D evaluation metrics to the mixed-curation 2,840-scan model (DSC=0.971 [p>.999], SD 2mm=0.958 [p>.999], HD95=2.87mm [p>.999]). The 710-scan mixed-curation (Slice DSC=0.929) significantly outperformed the highly curated, 244-scan model (Slice DSC=0.923 [p=0.012]) on the 30 external scans. Highly curated datasets yielded equivalent performance to datasets that were a full order of magnitude larger. The benefits of larger, mixed-curation datasets are evidenced in model generalizability metrics and local improvements. In conclusion, tradeoffs between dataset quality and quantity for model training are nuanced and goal dependent.
McCullum, L.; Ding, Y.; Fuller, C. D.; Taylor, B. A.
Show abstract
Background and Purpose: Magnetic resonance imaging (MRI) for radiation therapy treatment planning is currently being used in many anatomical sites to better visualize soft tissue landmarks, a technique known as an MRI simulation. A core component of modern MRI simulation configurations are the use of external laser positioning systems (ELPS) to help set up the patient. Though necessary for accurate and reproducible patient setup, the ELPS, if left on during imaging, may interfere negatively with image quality due to leaking electronic noise, of which MRI is sensitive to. It is currently unknown whether this leakage of electronic noise may further affect quantitative values derived from clinically employed relaxometric, diffusion, and fat fraction sequences. Therefore, in this study, we aim to characterize the impact of MRI simulation lasers on general image quality and quantitative imaging accuracy. Materials and Methods: First, a cine acquisition was used to visualize the real-time changes in image signal-to-noise ratio (SNR) from when the ELPS was deactivated to activated. To validate this effect quantitatively, the SNR was measured using the American College of Radiology (ACR) recommended protocol in a homogeneous phantom with the integrated body, 18-channel UltraFlex small, 18-channel UltraFlex large, 32-channel spine, and 16-channel shoulder coils. Next, a geometric distortion algorithm was tested in two vendor-provided phantoms while using the integrated body coil and the ACR Large Phantom protocol was tested. Finally, a series of quantitative MRI scans were performed using a CaliberMRI Model 137 Mini Hybrid phantom to validate quantitative T1, T2, and ADC while a Calimetrix PDFF-R2* phantom was used for quantitative PDFF and R2*. All scans were performed with both the ELPS both deactivated and activated. Results: Visible electronic noise artifacts were seen when using the integrated body coil when the ELPS was activated on the cine acquisition which led to a four-fold decrease in SNR using the ACR protocol. This SNR drop was not seen when using the remaining tested coils. The automatic fiducial detection algorithm was affected negatively by ELPS activation leading to misidentification when identified perfectly with the ELPS deactivated. Degradation in image intensity uniformity, percent signal ghosting, and low contrast object detectability was seen during ACR Large Phantom testing using the 20-channel Head/Neck coil. Concordance across quantitative MRI values was similar when the ELPS was both deactivated and activated while a consistent increase in standard deviation inside the ADC vials was seen when the ELPS was activated. Discussion: The extra noise induced from the activation of the ELPS during imaging should be avoided due to its potential to unnecessarily increase image noise. This is particularly true when conducting mandatory quality assurance testing for image quality and geometric distortion which utilize the integrated body coil which is most susceptible to ELPS-induced noise. Clear clinical guidelines should be implemented to make this issue known to the MRI technologists, physicists, and other relevant staff using an MRI with a supplementary ELPS for patient alignment.
Bjelovucic, R.; de Freitas, B. N.; Norholt, S. E.; Taneja, P.; Terp Hoybye, M.; Pauwels, R.
Show abstract
IntroductionDigital technologies are reshaping how health professionals are trained, and extended reality (XR) has gained attention as a tool for skills development in dental education. Yet, successful integration depends largely on educators perceptions, readiness, and working conditions. This study aimed to explore dental educators views of the educational value of XR, what barriers they experience, and how familiarity with immersive technologies relates to their use in teaching. Materials and MethodsA cross-sectional, web-based survey was conducted among dental educators. The questionnaire included items on demographics, familiarity and frequency of XR use, and perceptions of educational value, barriers, and curricular integration. Descriptive statistics were calculated, and Spearman correlation analyses were performed to explore associations between familiarity, use, and perceived benefits of XR. ResultsRespondents reported positive attitudes toward XR, particularly for improving students understanding of complex anatomy (mean = 6.02/7), skill development (5.68/7), and confidence and preparedness for clinical practice (5.08-5.20/7). XR was mainly viewed as a complement to traditional teaching rather than a replacement (mean = 3.77/7). Strong correlations were observed between perceived improvements in confidence, skills, and clinical readiness (r = 0.71 - 0.89, P < 0.0001). High costs, limited technical support, and time constraints were the most prominent barriers to usage. ConclusionOverall, dental educators appear open to XR but constrained by structural and organizational factors rather than a lack of interest. Faculty development, hands-on training opportunities, and institutional support may therefore be essential to translating positive perceptions into meaningful and sustained integration of immersive technologies in dental curricula.
Seo, W.; Jabur Agerberg, S.; Rashid, A.; Holmstrand, N.; Nyholm, D.; Virhammar, J.; Fallmar, D.
Show abstract
IntroductionIdiopathic normal pressure hydrocephalus (iNPH) is a partially reversible neurological disorder in which imaging biomarkers support diagnosis and surgical decision-making. The callosal angle (CA) is one of the most robust radiological markers of iNPH and has also been associated with postoperative shunt outcome. However, several manual measurement variants exist and artificial intelligence (AI)-based tools now enable automatic CA measurement. Materials and MethodsIn total 71 patients (40 with confirmed iNPH and 31 controls) were included. Six predefined manual methods for measuring CA were applied to preoperative 3D T1-weighted MRI and evaluated for diagnostic performance and interobserver agreement. An AI-derived automatic CA (cMRI from Combinostics) was included as a seventh method and compared with the traditional manual method (perpendicular to the bicommissural plane and through the posterior commissure). Automatic measurements were additionally assessed in pre- and postoperative scans to evaluate robustness against shunt-related artifacts. ResultsAll seven CA variants significantly differentiated iNPH patients from controls (p < 0.05). The traditional method showed the highest discriminative performance (AUC = 0.986, SE = 0.012), while alternative planes demonstrated slightly lower accuracy (AUC range = 0.957-0.978). Interobserver agreement for manual measurements was good to excellent (ICC = 0.687-0.977). Automatic CA measurements showed excellent correlation with the traditional method, preoperative ICC = 0.92; postoperative ICC = 0.96. ConclusionAlthough several CA positions perform comparably, the traditional method remains marginally superior and is best supported by the literature. Automated CA measurements closely match expert manual assessment in pre- and postoperative imaging, supporting clinical implementation.
Wu, J.; Perandini, L.; Batra, T.; Igoshin, S.; Bari, S.; de Araujo, A. L.; Willemink, M. J.
Show abstract
Digital breast tomosynthesis (DBT) is a powerful imaging modality that allows for improved lesion visibility, characterization, and localization compared to conventional two-dimensional digital mammography. DBT has been increasingly adopted in screening and diagnostic settings globally, particularly for women with dense breast tissue where tissue overlap presents a significant diagnostic challenge. Here we describe DBT-2026, a real world imaging dataset with 558 DBT exams from 558 patients with breast imaging reporting and data system (BI-RADS) scores of 0, 1, or 2. Each case contains one DBT examination in combination with expert annotations and free-text radiology reports that describe the radiological findings, produced in routine clinical practice. To protect patient privacy, all images and reports have been de-identified. The dataset is made freely available to researchers for non-commercial projects to facilitate and encourage research in breast cancer imaging.
Hartmann, K.; Beeche, C.; Judy, R.; DePietro, D. M.; Witschey, W. R.; Duda, J.; Gee, J.; Gade, T.; Penn Medicine Biobank, ; Levin, M.; Damrauer, S. M.
Show abstract
PurposePortal hypertension, a major complication of chronic liver disease, leads to significant morbidity and mortality. While portal vein diameter measured on imaging has long been proposed as a non-invasive marker of portal hypertension, normative CT-based reference values and population-level associations remain incompletely characterized. Here, we aim to define contemporary reference values for portal vein diameter on clinically obtained CT and evaluate its associations with demographic, clinical, and imaging factors, as well as its diagnostic performance for portal hypertension. MethodsWe conducted a retrospective analysis of 20,225 clinically obtained CT scans at a single academic medical center. The main portal vein was automatically segmented using Total Segmentator, and maximum diameter extracted using the Vascular Modeling Toolkit. Associations with demographic and imaging factors were evaluated using linear mixed-effects models; prevalent liver disease and portal hypertension using logistic regression; risk of incident ascites and esophageal varices among participants with liver disease using Cox regression; and invasive hepatic venous pressures using correlation analysis and linear regression. ResultsThe mean portal vein diameter was 12.4 mm (95% CI, 12.37-12.45). Larger diameter was independently associated with male sex (+1.4 mm), higher BMI (+0.11 mm/kg/m2), greater height (+0.04 mm/cm), and older age (+0.05 mm/10 years) (all p <0.001), and was substantially larger on contrast-enhanced abdomen/pelvis CT (+2.4 mm, p <0.001). Each 1-mm increase in portal vein diameter was associated with higher odds of prevalent liver disease (OR 1.06; 95% CI, 1.04-1.08) and portal hypertension (OR 1.18; 95% CI, 1.12-1.28). Among individuals with liver disease, greater diameter predicted higher risk of incident esophageal varices (baseline diameter HR 1.50; 95% CI, 1.14-2.08) and ascites (HR per mm increase in diameter 1.06; 95% CI, 1.003-1.12). However, portal vein diameter demonstrated weak to no association with invasively measured hepatic venous pressures. ConclusionIn this large, EHR-linked imaging cohort, the mean portal vein diameter on CT was 12.4 mm and varied with demographic and imaging factors. Larger diameter was associated with liver disease, portal hypertension, and subsequent development of varices and ascites, supporting use of portal vein diameter as a pragmatic screening or enrichment tool within multimodal clinical frameworks. Key ResultsO_LIMean portal vein diameter on routine clinical CT was 12.4 mm (95% CI, 12.37-12.45) and varied with sex, height, BMI, exam type, contrast use, and clinical setting. C_LIO_LIEach 1-mm increase in portal vein diameter was associated with higher odds of prevalent liver disease (OR 1.06) and portal hypertension (OR 1.18). C_LIO_LIAmong individuals with liver disease, larger portal vein diameter predicted higher risk of incident esophageal varices and ascites, independent of demographic and imaging factors. C_LI
Hoe, Z. Y.; Ding, R.-S.; Chou, C.-P.; Hu, C.; Lee, C.-H.; Tzeng, Y.-D.; Pan, C.-T.; Lee, M.-C.; Lee, E. K.-L.
Show abstract
BackgroundBreast cancer-related lymphedema (BCRL) is a common complication following breast cancer treatment. While lymphoscintigraphy is considered the diagnostic gold standard, it is unsuitable for routine periodic monitoring or assessment of treatment efficacy. Shear wave elastography (SWE) offers a possible alternative, but traditional modes of operation limit its potential. Proposed SolutionsThe Holder-Optimized Elastography (HOE) method is introduced to eliminate pressure issues introduced by manual operation of ultrasound probes by stabilizing them above the cutis. MethodsThe HOE method was used to acquire ARFI images of high-velocity areas (HVAs, with shear wave velocity greater than 7 m/s) in limbs with and without BCRL (as confirmed and characterized by lymphoscintigraphy) in two cohorts of 15 and 125 patients. ResultsThe HOE method enabled ARFI elastography to directly and consistently visualize the effects caused by both obstructed lymphatic vessels and intraluminal lymphatic fluid as HVAs, whereas traditional hand-held methods did not. Inter-limb differences in HVA burden showed moderate diagnostic performance for detecting BCRL and grading obstruction with modest sensitivity. However, there was systematic underestimation of both early and confluent advanced lesions. ConclusionHOE-based HVA imaging has potential for rapid and non-invasive monitoring of lymphedema course and treatment response and may serve as a useful adjunct to existing diagnostic tools for BCRL. However, further technical refinements and quantitative analytic methods will be required to fully exploit the richer SWV information provided by HOE and to enhance the diagnostic utility of HVAs. Summary StatementThe Holder-Optimized Elastography method ("HOE" method) increases the diagnostic capability of ARFI elastography for breast cancer-related lymphedema, allowing for the non-invasive detection of some lymphatic obstructions but not all. Key ResultsThe Holder-Optimized Elastography (HOE) method revealed the effects caused by fluid-filled lymphatic vessels as "High-Velocity Areas" (HVAs), which are difficult to detect by conventional methods. HVA counts for detecting lymphedema (any obstruction vs. no obstruction) showed high specificity (0.86-1.00) but low sensitivity (0.57-0.67). Conversely, HVA counts for staging lymphedema (i.e. total vs. partial obstruction) showed high sensitivity (up to 1.00) but low specificity (0.48-0.66). The inter-limb difference of HVAs counted in whole-limb scans between affected and unaffected limbs (aka, the "Global Mean Difference") provided the most balanced diagnostic performance (sensitivity 0.67-0.79, specificity 0.88-0.89).
Fink, A.; Burzer, F.; Sacalean, V.; Rau, S.; Kaestingschaefer, K. F.; Rau, A.; Koettgen, A.; Bamberg, F.; Jaenigen, B.; Russe, M. F.
Show abstract
BackgroundKidney volumetry derived from CT has been proposed as a surrogate of renal function in living kidney donor evaluation. However, clinical integration has been limited by reader-dependent workflows and semiautomatic methods susceptible to image quality. PurposeTo evaluate whether fully automated CT-based segmentation of renal cortex, medulla and total parenchymal volume provides reproducible volumetric biomarkers associated with global and split renal function in living kidney donor candidates. Materials and MethodsIn this retrospective single-center study, 461 living kidney donor candidates (2003-2021) underwent contrast-enhanced abdominal CT. A convolutional neural network was trained to automatically segment cortical, medullary, and total parenchymal volumes on arterial-phase images. Segmentation performance was evaluated against manual reference annotations. Volumes were indexed to body surface area. Associations with eGFR, 24-hour creatinine clearance, cystatin C, and tubular clearance were assessed using Spearman correlation coefficient ({rho}), and side-specific volume fractions were compared with scintigraphy -derived split function. ResultsAutomated segmentation achieved excellent agreement with expert reference segmentations (Dice 0.95 for cortex; 0.90 for medulla). eGFR correlated moderately with cortical ({rho} = 0.46) and total parenchymal volume ({rho} = 0.45), and modestly with medullary volume ({rho} = 0.30). Similar associations were observed for other global measures, with the strongest correlation for cortical volume and tubular clearance ({rho} = 0.53). Side-specific volume fractions correlated with scintigraphy-derived split renal function ({rho} = 0.49-0.56; all p < 0.001). ConclusionAutomated CT-based renal subcompartment segmentation provides reproducible volumetric biomarkers within routine donor evaluation. Cortical volume performs comparably to total parenchymal volume and tracks split renal function at the cohort level, suggesting potential utility in donor assessment.
Mahfouz, M.; Alzaben, E.
Show abstract
BackgroundCanine impaction represents one of the most challenging clinical scenarios in orthodontic practice, with maxillary canines being the second most commonly impacted teeth after third molars. The management of impacted canines through orthodontic traction requires an advanced understanding of biomechanical principles, surgical techniques, and patient-specific factors. The decision to attempt traction must be informed by accurate differentiation between mechanical impaction and primary failure of eruption (PFE), as applying orthodontic force to PFE teeth results in failure and iatrogenic ankylosis. Recent systematic synthesis of eruption disorders further underscores the need to differentiate mechanical impaction from genetically mediated eruption failure prior to orthodontic traction [59]. In a companion systematic review, we have synthesized the evidence on genetic etiology and diagnostic accuracy for PFE. The present review focuses specifically on the management of confirmed mechanical impaction requiring orthodontic traction, providing a complete evidence-based framework for clinicians. ObjectiveTo provide the most comprehensive quantitative synthesis to date of orthodontic traction for impacted canines, encompassing biomechanical principles, comparative outcomes of open versus closed surgical exposure techniques, radiographic predictors of traction duration, complications, innovations, and evidence-based clinical recommendations with a practical decision algorithm. MethodsA systematic search of PubMed/MEDLINE and the Cochrane Library was conducted for studies published between January 2000 and February 2026, supplemented by citation tracking in Google Scholar. The PRISMA 2020 guidelines were followed. The protocol was prospectively registered on the Open Science Framework (DOI: 10.17605/OSF.IO/3UDH6). Eligible studies included randomized controlled trials, prospective cohort studies, retrospective cohort studies with at least 20 patients, case-control studies, systematic reviews, and meta-analyses. Risk of bias was assessed using ROBINS-I, RoB 2.0, and ROBIS tools. Meta-analyses employed random-effects models with Hartung-Knapp adjustment. Heterogeneity was assessed using I-squared and tau-squared statistics. Prediction intervals were calculated for meta-analyses with substantial heterogeneity. The GRADE framework evaluated evidence quality. Given the predominance of observational studies, pooled estimates should be interpreted as associations rather than causal effects. ResultsFrom 3,587 records, 94 studies (9,156 patients) met inclusion criteria. Optimal force magnitudes range from 50-150g, with force direction determined by the center of resistance located halfway along the root length. Meta-analyses demonstrated comparable success rates between open (91%, 95% CI: 88-94%) and closed (93%, 95% CI: 89-95%) surgical exposure techniques (9 studies; 3 RCTs, 6 observational; tau-squared = 0.00). Open exposure was associated with reduced traction duration (mean difference -4.7 months, 95% CI: -7.3 to -2.1; I-squared = 87%, tau-squared = 5.82; prediction interval -9.8 to 0.4 months) and lower ankylosis risk (OR 0.15, 95% CI: 0.03-0.83; I-squared = 0%, tau-squared = 0.00). Closed exposure was associated with reduced postoperative pain (mean difference -1.9 VAS, 95% CI: -2.6 to -1.2; I-squared = 0%, tau-squared = 0.00). Radiographic predictors include alpha-angle (beta = 0.16 months/degree), d-distance (beta = 1.20 months/mm), and sector location. Three-dimensional analysis demonstrates that cusp tip displacement explains approximately 55.4% of variance in traction duration. Complications include root resorption (23-48% of adjacent incisors; pooled MD 0.69 mm, 95% CI: 0.58-0.80 mm), alveolar bone loss (pooled MD 0.51 mm, 95% CI: 0.31-0.72 mm), and ankylosis (3.5-14.5%). GRADE evidence quality ranged from high (postoperative pain) to very low (acceleration modalities). Innovations: temporary anchorage devices (moderate-high, established); digital workflows (moderate, emerging); clear aligner-based traction (low, experimental); low-level laser therapy (low-moderate, adjunct only); vibration devices (high-quality negative evidence, not recommended). ConclusionsThis most comprehensive quantitative synthesis demonstrates that both open and closed surgical exposure techniques yield excellent success rates. Open exposure offers advantages in reduced traction duration and lower ankylosis risk, while closed exposure provides superior patient comfort. Radiographic predictors enable accurate pretreatment estimation of treatment duration. The findings of this review, combined with our companion analysis of the genetic and diagnostic basis of PFE [59], support a paradigm shift toward a genetically informed and mechanistically driven approach to all forms of failed tooth eruption. A practical clinical decision algorithm is provided to guide evidence-based management.
Nagar, S. S.; Chandra, R. V.; Aileni, A. R.; Goud, V. S.
Show abstract
Aim and ObjectivesThe study aimed to evaluate the effectiveness of titanium inserts for interdental papilla reconstruction, comparing it with the Han and Takei technique using subepithelial connective tissue grafts. The objectives included assessing the black triangle height, papilla height and papilla presence index (PPI) at baseline, 1 month and 3 months postoperatively along with the evaluation of Early Wound Healing Score (EHS) during the first week of post operative healing period. Patients and MethodsThis single-blind randomized clinical trial included systemically healthy individuals aged 18-35 years with Nordland and Tarnows Class I-III papillary loss. A total of 18 participants were randomly assigned to either test group or control group. Clinical parameters were measured pre- and post-operatively at specified intervals. Both groups received standard presurgical care and postoperative follow-up. The surgical protocol for the test group involved titanium insert placement in the interdental bone, while the control group received a connective tissue graft using the Han and Takei method. ResultsBoth groups showed significant intragroup improvements in all parameters from baseline to 1 and 3 months (p<0.05). However, intergroup comparisons showed no significant differences at most time points, except at 3 months for PPI, where the control group showed significantly better results (p=0.04). EHS scores were not significant between the groups. ConclusionTitanium inserts and CTG both demonstrated clinical effectiveness in enhancing interdental papilla dimensions. These findings support the titanium insert as a viable, less invasive alternative, offering clinicians a practical option for esthetic papilla reconstruction.
Xie, C.; Wang, Y.; Li, D.; Yu, B.; Peng, S.; Wu, L.; Yang, M.
Show abstract
Handheld ultrasound devices have revolutionized point-of-care diagnostics, but their effectiveness remains limited by operator dependency and the need for specialized training. This paper presents an intelligent guidance and diagnostic assistance system for the handheld wireless ultrasound device, enabling automated carotid artery and thyroid examinations through handheld operation. Drawing inspiration from the Actor-Critic framework, we implement a simulation-based reinforcement learning approach for real-time probe navigation toward standard anatomical views. The system integrates YOLOv8n-based detection networks for carotid plaque and thyroid nodule identification, achieving real-time inference at 30 frames per second. Furthermore, we propose a hybrid measurement approach combining UNet segmentation with the Snake algorithm for precise biometric quantification, including carotid intima-media thickness (IMT), lumen diameter, and lesion dimensions. Experimental validation on clinical datasets demonstrates that the proposed system achieves 91.2% accuracy in standard plane acquisition, 87.5% mean average precision (mAP) for plaque detection, and 89.3% mAP for nodule identification. Measurement results show excellent agreement with expert sonographers, with IMT measurements exhibiting a mean absolute difference of 0.08 mm. These findings demonstrate the feasibility of intelligent handheld ultrasound examination, significantly reducing operator dependency while maintaining diagnostic accuracy comparable to experienced clinicians.
Anderson, O.; Hung, R.; Fisher, S.; Weir, A.; Voisey, J. P.
Show abstract
Radiogenomics enables the non-invasive characterisation of the genomic and molecular properties of tumours, with epidermal growth factor receptor (EGFR) mutations in non-small cell lung cancer (NSCLC) being one of the most investigated applications. In this study, we evaluate radiomics, contrastive learning, and convolutional deep learning approaches to predict the EGFR mutation status from chest Computed Tomography (CT) images using the TCIA Radiogenomics dataset (n=115). Our results, using 10-fold cross validation, demonstrate the capacity of imaging models to predict mutation status from CT data in a manner consistent with existing literature. Among the evaluated methods, models integrating radiomic with clinical features achieved the best performance, with an AUC of 0.790 and AUPRC of 0.517, outperforming both contrastive learning (AUC=0.787) and convolutional architectures (AUC=0.763). Beyond methodological comparisons, we discuss the challenges related to clinical translation. Specifically, we contrast radiogenomics with conventional tissue biopsies, and identify scenarios where radiogenomics might be useful, either independently or in conjunction with other existing diagnostic technologies. Together these findings evidence the potential utility of radiogenomics EGFR models and provide direct architecture comparisons on the same dataset.
Mahfouz, M.; Alzaben, E.
Show abstract
BackgroundFailure of tooth eruption (FTE) encompasses mechanical impaction, primary failure of eruption (PFE), and syndromic disturbances. Since the seminal review by Suri et al. (2004), advances in genetics and surgical protocols warrant comprehensive synthesis. ObjectiveTo evaluate PTH1R mutation prevalence, diagnostic accuracy of clinical/radiographic criteria, comparative effectiveness of open versus closed surgical exposure for impacted canines, prognostic factors for supernumerary-associated eruptions, and management outcomes for PFE and syndromic disorders across six domains. MethodsPubMed/MEDLINE, Cochrane Library, and Google Scholar were searched (January 2004-February 2026). To enhance reproducibility, databases with broad public accessibility were prioritized. Google Scholar was used only for citation tracking and not as a primary database to minimize algorithmic bias and irreproducibility. PRISMA 2020 guidelines were followed. Protocol registered on OSF (DOI: 10.17605/OSF.IO/R5X76). Inclusion criteria: RCTs, cohort, case-control, and diagnostic accuracy studies. Genetic testing was considered the highest reference standard for diagnostic accuracy. Risk of bias assessed using ROBINS-I, QUADAS-2, and RoB 2.0. Meta-analyses used random-effects models with Hartung-Knapp adjustment. Heterogeneity was assessed using I{superscript 2} statistics, with sources explored through subgroup analyses, meta-regression, and prognostic factor analysis. GRADE evaluated evidence quality. Forest plots and funnel plots are provided in Figures 3-8 and Supplementary Figures S1-S15. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=126 SRC="FIGDIR/small/26346646v1_fig3.gif" ALT="Figure 3"> View larger version (10K): org.highwire.dtl.DTLVardef@1d71b0forg.highwire.dtl.DTLVardef@1318309org.highwire.dtl.DTLVardef@1920208org.highwire.dtl.DTLVardef@c36c6f_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 3:C_FLOATNO Forest Plot - Treatment Duration Difference (Closed vs. Open Exposure). Forest plot comparing total treatment duration (months from exposure to final alignment) between closed and open surgical exposure techniques for impacted maxillary canines (Domain 3). Data from 8 studies comprising 1,287 canines. Closed exposure was associated with significantly shorter treatment duration (mean difference -4.7 months; 95% CI: -7.3 to -2.1; p < 0.001). Heterogeneity was moderate to high (I{superscript 2} = 64.1%), partially explained by study design in meta-regression (RCTs vs. cohorts, p = 0.04). The 95% prediction interval (-9.8 to 0.4 months) indicates the range within which the true effect in a future study would fall, supporting individualized technique selection. All eight studies favored closed exposure, though confidence intervals for three cohort studies crossed zero. Study weights ranged from 4.0% to 18.2%. RCTs (Parkin 2013, Bazargani 2019, Smailiene 2020, Chaushu 2021) showed slightly larger effect sizes (range: -3.8 to -6.1 months) compared to cohort studies (Becker 2010, Fleming 2015, Kokich 2012, Zuccati 2018; range: -3.2 to -6.4 months). Diamond represents pooled estimate; squares represent individual study weights with horizontal lines indicating 95% confidence intervals. C_FIG O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=142 SRC="FIGDIR/small/26346646v1_fig8.gif" ALT="Figure 8"> View larger version (40K): org.highwire.dtl.DTLVardef@42959org.highwire.dtl.DTLVardef@136c662org.highwire.dtl.DTLVardef@11a59e3org.highwire.dtl.DTLVardef@1035b2a_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 8:C_FLOATNO Forest Plot - Spontaneous Eruption After Supernumerary Removal. Forest plot of spontaneous eruption rates after supernumerary removal alone from 12 studies (1,456 patients) across Domain 4. Reported rates ranged from 48% to 68% across studies (I{superscript 2} = 71.2%). High heterogeneity reflects differences in patient age (deciduous vs. mixed vs. permanent dentition), supernumerary morphology (conical vs. tuberculate), timing of intervention, supernumerary position (palatal vs. labial vs. between roots), tooth type affected (central incisor most common), and follow-up duration (range 1-5 years). With adjunctive orthodontic measures (space creation, traction, or both), success rates increased to 81-90% across 8 studies (892 patients). Study weights ranged from 8.4% to 8.9%. Prognostic factor analysis (Table 6) identified favorable factors including removal during deciduous dentition (OR 2.5-5.5), conical supernumerary morphology (OR 3.0-6.5), and incomplete root formation of the permanent incisor (OR 2.5-5.0). Unfavorable factors included tuberculate morphology (OR 0.2-0.4) and complete root formation (OR 0.2-0.5). Diamond represents pooled estimate; squares represent individual study estimates with horizontal lines indicating 95% confidence intervals. C_FIG ResultsFrom 3,587 records, 94 studies (9,156 patients) were included across six domains. Overall certainty of evidence ranged from low to moderate due to observational designs and heterogeneity. Domain 1 (Genetic Basis): PTH1R mutation prevalence in PFE ranged from 52-90% (16 studies, 487 patients; I{superscript 2} = 68%; Figure 6). Heterogeneity reflected differences in familial vs. sporadic cases and referral bias. Population-level prevalence remains unknown. Sixty-three variants identified. Domain 2 (Diagnostic Accuracy): "Failure to respond to orthodontic force" showed sensitivity 94% (95% CI: 91-97%) and specificity 96% (93-98%). "Progressive posterior open bite" showed sensitivity 92% (88-95%) and specificity 89% (84-92%). Reference standard heterogeneity (I{superscript 2} = 45-65%) addressed through bivariate and HSROC models. CBCT provided superior root resorption detection (97% vs. 68%; p < 0.001). Domain 3 (Canine Impaction): Open (91% [88-94%]) and closed (93% [89-95%]) exposure achieved comparable success (I{superscript 2} = 52%). Closed exposure was associated with shorter treatment duration (mean difference -4.7 months [-7.3 to -2.1]; I{superscript 2} = 64%; Figure 3) and lower postoperative pain (-1.9 VAS [-2.6 to -1.2]; I{superscript 2} = 58%; Figure 4). Prediction intervals (-9.8 to 0.4 months) support individualized technique selection. Funnel plots showed no significant publication bias (Figure 7). Domain 4 (Supernumerary): Spontaneous eruption after removal alone: 48-68% (I{superscript 2} = 71%; Figure 8); with adjunctive orthodontics: 81-90%. Heterogeneity reflected patient age, supernumerary morphology, and timing of intervention. Favorable factors: deciduous removal (OR 2.5-5.5), conical morphology (OR 3.0-6.5), incomplete root formation (OR 2.5-5.0). Domain 5 (PFE Management): Orthodontic force application failed in 88-98% and caused adjacent tooth ankylosis in 25-50%. Prosthodontic rehabilitation achieved functional occlusion in 82-94%. Implant success: 85-95%. Meta-analysis not performed due to critical heterogeneity. Domain 6 (Syndromic): Cleidocranial dysplasia alignment: 61-75%. Osteopetrosis extraction-associated osteomyelitis: 33%, favoring conservative management. Narrative synthesis only. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=148 SRC="FIGDIR/small/26346646v1_fig6.gif" ALT="Figure 6"> View larger version (40K): org.highwire.dtl.DTLVardef@15622eborg.highwire.dtl.DTLVardef@e7403org.highwire.dtl.DTLVardef@e27724org.highwire.dtl.DTLVardef@1fbe10a_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 6:C_FLOATNO Forest Plot - PTH1R Mutation Prevalence. Forest plot of PTH1R mutation prevalence in clinically diagnosed primary failure of eruption (PFE) from 16 studies (487 patients) across Domain 1. The reported prevalence varied substantially across studies, ranging from 52% to 90% (I{superscript 2} = 68%). Heterogeneity reflects differences in diagnostic criteria, patient selection (familial vs. sporadic cases), and referral bias. Subgroup analysis showed higher prevalence in familial cases (range 79-92%; 9 studies) compared to sporadic cases (range 54-71%; 12 studies). Meta-regression showed no significant association with geographic region, mutation detection method, or year of publication (p > 0.05 for all). Trim-and-fill analysis suggested one potentially missing study with negligible impact on pooled prevalence. Study weights ranged from 5.7% to 6.8%. The most frequently reported studies include Frazier-Bowers 2010 (0.75, 95% CI: 0.58-0.87), Risom 2013 (0.82, 95% CI: 0.66-0.92), and Park 2025 (0.89, 95% CI: 0.74-0.96). Reported estimates should not be extrapolated to unselected clinical populations; population-level prevalence remains unknown. Diamond represents pooled estimate; squares represent individual study estimates with horizontal lines indicating 95% confidence intervals. C_FIG O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=147 SRC="FIGDIR/small/26346646v1_fig4.gif" ALT="Figure 4"> View larger version (17K): org.highwire.dtl.DTLVardef@1737e7forg.highwire.dtl.DTLVardef@175c6a4org.highwire.dtl.DTLVardef@1446af8org.highwire.dtl.DTLVardef@caff01_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 4:C_FLOATNO Forest Plot - Postoperative Pain Difference (Closed vs. Open Exposure). Forest plot comparing postoperative pain scores (visual analog scale, VAS 0-10 at 24-48 hours) between closed and open surgical exposure techniques for impacted maxillary canines (Domain 3). Data from 5 studies comprising 842 patients. Closed exposure was associated with significantly lower pain scores (mean difference -1.9; 95% CI: -2.6 to -1.2; p < 0.001). Heterogeneity was moderate (I{superscript 2} = 58.2%), reflecting differences in pain measurement timing (24h vs. 48h), analgesic protocols, and study design (RCT vs. cohort). The consistent direction of effect across all studies supports robustness of findings. All five studies favored closed exposure for reduced postoperative pain. Study weights ranged from 17.5% to 22.4%. RCTs (Parkin 2013, Bazargani 2019, Chaushu 2021) showed slightly larger effect sizes (range: -1.8 to -2.4) compared to cohort studies (Becker 2010, Fleming 2015; range: -1.2 to -1.6). Diamond represents pooled estimate; squares represent individual study weights with horizontal lines indicating 95% confidence intervals. C_FIG O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=114 SRC="FIGDIR/small/26346646v1_fig7.gif" ALT="Figure 7"> View larger version (29K): org.highwire.dtl.DTLVardef@12bbffdorg.highwire.dtl.DTLVardef@1497eb8org.highwire.dtl.DTLVardef@1e879eorg.highwire.dtl.DTLVardef@59d3ae_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOFigure 7:C_FLOATNO Funnel Plot - Publication Bias for Canine Studies. Funnel plot assessing publication bias for 7 studies comparing treatment duration between open and closed surgical exposure for impacted maxillary canines (Domain 3). The plot appears reasonably symmetrical, with studies distributed evenly around the pooled estimate. Eggers test was non-significant (p = 0.38), suggesting no strong evidence of publication bias for this outcome. Each circle represents an individual study. The funnel shape represents the pseudo 95% confidence interval limits. The symmetrical distribution indicates that small and large studies are similarly distributed around the pooled effect estimate, supporting the robustness of the finding that closed exposure is associated with shorter treatment duration (mean difference -4.7 months; 95% CI: -7.3 to -2.1). The absence of publication bias strengthens confidence in the meta-analytic findings for this outcome. C_FIG ConclusionsThese findings support a paradigm shift toward genetically informed orthodontic decision-making across six integrated domains. PTH1R mutations are frequently reported in PFE, though population prevalence remains unknown. Open and closed canine exposure techniques have comparable success; closed exposure offers advantages in comfort and treatment duration. Early supernumerary intervention improves outcomes. Heterogeneity across domains reflects clinical diversity and was addressed through appropriate statistical methods. Orthodontic forces should be avoided in confirmed PFE. RegistrationOpen Science Framework (DOI: 10.17605/OSF.IO/R5X76)
Krueger, D.; Binkley, N.; Madeira, M.; Chen, Z.; Di Gregorio, S.; Del Rio, L.; Humbert, L.
Show abstract
3D-DXA reconstructs DXA hip scans to 3-dimensional images allowing measurement of trabecular and cortical bone parameters. Given the higher image quality of GE Healthcare iDXA than GE Healthcare Prodigy, it could be hypothesized that the reconstruction might differ, thereby affecting 3D-DXA results. The aim of the study was to assess agreement and precision of 3D-DXA cortical and trabecular femur parameters between Prodigy and iDXA densitometers in adult subjects. The study cohort was composed of 391 men and women recruited from 3 clinical centers (USA and Brazil). All subjects were scanned on either Prodigy or iDXA scanners. Short-term precision was assessed on two Prodigy and two iDXA densitometers. 3D-DXA analyses were performed using 3D-Shaper software version 2.14. Agreement between densitometers was assessed by regression and Bland-Altman analyses. Short-term precision was determined following International Society for Clinical Densitometry recommendations. Strong agreements for 3D-DXA parameters were obtained between devices regardless of the center or the DXA device model (all R2 > 0.96). Bland-Altman analyses demonstrated statistically (p < 0.05), but not clinically, significant difference between both aBMD and 3D-DXA measurements obtained using Prodigy and iDXA scanners. Short-term precision of areal BMD and 3D-DXA parameters was similar between densitometers. This study demonstrated excellent 3D-DXA measurement agreement and similar precision between iDXA and Prodigy densitometers. These data provide evidence that no adjustments are required when using 3D-Shaper software on iDXA or Prodigy instruments. Mini AbstractWe assessed agreement and precision of 3D-DXA parameters between GE Healthcare Prodigy and iDXA densitometers in adults. Strong agreement was observed between devices, and short-term precision was comparable. Findings indicate that no adjustment is needed when using 3D-DXA with GE Healthcare densitometers.
Miyata, M.; Tomiyasu, M.; Sahara, Y.; Tsuchiya, H.; Maeda, T.; Tomoyori, N.; Kawashima, M.; Kishimoto, R.; Mizota, A.; Kudo, K.; Obata, T.
Show abstract
PurposeAqueous humor drains fluid from the eye not only via the conventional pathway through the trabecular meshwork and Schlemms canal, but also within the eye is known to occur via pathways through the posterior chamber and optic nerve to the cerebrospinal fluid (CSF) surrounding the optic nerve. The mechanism is poorly understood, and non-invasive method for evaluation in living humans has not been established. We previously showed that eye drops containing O-17-labeled water (H217O) distribute in the anterior chamber but not the vitreous. This study aimed to evaluate the distribution of H217O in the CSF along the optic nerve. MethodsFive ophthalmologically normal participants (20-31 years, all females) were selected from a previous prospective study based on 1H MR images of the eyes that included the optic nerve. They received eye drops of 10 mol% H217O in their right eye. Dynamic image time series was created by normalizing the signal of each 1H-T2WI by the pre-drop average signal. Region-of-interest analyses were performed for signal changes in the anterior chamber, vitreous, and CSF. ResultsIn the quantitative evaluation, the normalized intensity in the anterior chamber and CSF was significantly lower than that in the pre-drop signal (anterior chamber: 0.78 {+/-} 0.07, p < 0.005; CSF: 0.89 {+/-} 0.07, p < 0.05). No distribution was identified in the vitreous. Qualitatively, the distribution of H217O in the anterior chamber was detected in all five participants and in the CSF of four participants (80%). ConclusionH217O eye drops were distributed in the anterior chamber and CSF, but not in the vitreous. These findings suggest that the visualization of aqueous humor outflow, not via the Schlemms canal, may contribute to ocular fluid homeostasis, including the ocular glymphatic system.
Fisher, G. R.
Show abstract
In previous work, we achieved state-of-the-art performance on ChestX-ray14 (ROC-AUC 0.940, F1 0.821) using pretraining diversity and clinical metric optimization. Applying the same methodology to CheXpert, we received similar results when using NLP valuation and test data--but when evaluated against expert radiologist labels, performance was only 0.75-0.87 ROC-AUC. The models had learned to match the automated NLP labeling system, not to diagnose disease. This paper documents our investigation into this failure and our suggested resolution. We identify the NLP-to-Expert generalization gap: a systematic divergence between models optimized on labels extracted from radiology reports and their agreement with board-certified radiologists. More surprisingly, we discovered that directly optimizing for small expert-labeled validation sets can be counterproductive-- models with lower validation scores often generalized better to held-out expert test data. Four findings emerged: First, expert-labeled images for at least the validation and testing datasets, even if not for training, were vital in revealing the gap between NLP agreement and diagnostic accuracy. Without them, our models appeared excellent while failing to generalize to clinical judgment. Second, less training is better. Short training (1-5 epochs) outperformed extended training (60+ epochs) because longer training doesnt improve the model--it memorizes the labelers mistakes. Third, ImageNet features are sufficient. Freezing the pretrained backbone and training only the classifier achieved 0.891 ROC-AUC--matching models with full fine-tuning. The rapid convergence we observed wasnt the model learning chest X-ray features; it was the classifier calibrating to already-sufficient visual representations. Fourth, regularization beats optimization. Label smoothing and frozen backbones--methods that prevent overfitting--outperformed direct metric optimization on small validation sets. The 200 expert-labeled validation images in CheXpert are too few to optimize directly; they are better used as a compass than a target. With these insights, we improved from 0.823 to 0.917 ROC-AUC, exceeding Stanfords official baseline (0.907).